Heuristic Optimization of SPARQL queries over Column-Store DBMS
نویسنده
چکیده
During the last decade we have witnessed a tremendous increase in the amount of semantic data available on the Web in almost every field of human activity. More and more corporate, governmental, or even user-generated datasets break the walls of “private” management within their production site, are published, and become available for potential data consumers, i.e., applications/services, individual users and communities. In this context, The Web of Data extends current Web to a global data space connecting data from diverse domains. This gives added value for decision support and business intelligence applications, and enables new types of services that operate on top of an unbound, global data space and not on a fixed set of data sources as in Web 2.0 mashups. A central issue in this respect is the manipulation and usage of data based on their meaning by using effective and efficient support for storing, querying, and manipulating semantic RDF data, the lingua franca of Linked Open Data and hence the default data model for the Web of Data. In this thesis we are focusing on the problem of scalable processing and optimization of semantic queries expressed in SPARQL using modern relational engines. Existing native or SQL-based engines for processing SPARQL queries heavily rely on statistics regarding the stored RDF graphs as well as adequate cost based planning algorithms to optimize complex join queries. Extensive data statistics are quite expensive to compute and maintain for large scale evolving semantic data over the Web. The main challenge in this respect is to devise heuristics-based query optimization techniques generating near to optimal execution plans without any knowledge of the underlying datasets. For this reason we propose the first heuristics-based SPARQL planner (HSP) that is capable of exploring the syntactic variations of triple patterns in a query in order to choose a near to optimal execution plan without the use of a cost model. Furthermore, we have implemented HSP plans on top of the MonetDB column-based DBMS. We have paid particular attention to the efficient implementation of HSP logical plans to the underlying MonetDB query execution engine by translating them into MonetDB’s physical algebra (MAL). We have finally, experimentally evaluated the quality and execution time of the plans produced by HSP with a state-of-the-art Cost-based Dynamic Programming (CDP) algorithm employed by RDF-3X using synthetically generated and real RDF datasets. In all queries of our workload, HSP produce plans with the same number of merge and hash joins as CDP. Their differences lie on the employed ordered variables as well as the execution order of joins which essentially affect the size of intermediate results. With the exception of queries which are not substantially different in their syntax, HSP plans executed on MonetDB outperform those of CDP executed in RDF-3X up to three orders of magnitude. More precisely, HSP tries to produce plans that maximize the number of merge joins over the ordered variables which are shared among the triple patterns of a query and relies on various heurists to decide which ordered variables will be used in selections and joins as well as which underlying access paths will be exploited for evaluating the triple patterns (essentially sorted triple relations in MonetDB). Supervisor: Vassilis Christophides Professor
منابع مشابه
Massive-Scale RDF Processing Using Compressed Bitmap Indexes
The Resource Description Framework (RDF) is a popular data model for representing linked data sets arising from the web, as well as large scientific data repositories such as UniProt. RDF data intrinsically represents a labeled and directed multi-graph. SPARQL is a query language for RDF that expresses subgraph pattern-finding queries on this implicit multigraph in a SQLlike syntax. SPARQL quer...
متن کاملAn Experimental Comparison of RDF Data Management Approaches in a SPARQL Benchmark Scenario
Efficient RDF data management is one of the cornerstones in realizing the Semantic Web vision. In the past, different RDF storage strategies have been proposed, ranging from simple triple stores to more advanced techniques like clustering or vertical partitioning on the predicates. We present an experimental comparison of existing storage strategies on top of the SPBench SPARQL performance benc...
متن کاملOptimizing SPARQL queries over the Web of Linked Data
The web of linked data represents a globally distributed dataspace. It can be queried with SPARQL whose execution takes place by asynchronously traversing the RDF links to discover data sources at run-time. However, the optimization of SPARQL queries over the web of data remains a challenge and in this paper we present an approach addressing this problem. The proposed approach works in two-phas...
متن کاملDatabase System Support of Simulation Data
Supported by increasingly efficient HPC infra-structure, numerical simulations are rapidly expanding to fields such as oil and gas, medicine and meteorology. As simulations become more precise and cover longer periods of time, they may produce files with terabytes of data that need to be efficiently analyzed. In this paper, we investigate techniques for managing such data using an array DBMS. W...
متن کاملEvaluation of SPARQL Property Paths via Recursive SQL
Property paths, a part of the proposed SPARQL 1.1 standard, allow for non-trivial navigation in RDF graphs. We investigate the evaluation of SPARQL queries with property paths in a relational RDF store. We propose a translation of SPARQL property paths into recursive SQL and discuss possible optimization strategies.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011